Miko Planas

Capability

Focuses on whether a model is capable of performing a task

Advantages
- Fairly simple
- Enough for now
Disadvantages
- Susceptible to deceptive models
- May not be enough in the future
- Easy for us to lock ourselves in a particular eval framework that becomes mainstream -- but is insufficient

Focuses on the propensity of a model to use dangerous capabilities.

Disadvantages
- Generally difficult to implement since misalignment/failure modes for alignment are elusive
Note/s
- Does not necessarily mean mechanistic interpretability
Understanding-based Evaluations that are currently insufficient, but may be starting points:
- Causal Scrubbing
  - A principled approach to evaluating the quality of mechanistic interpretations
  - A systematic Ablation method for testing precisely stated hypotheses about how a particular neural network implements a behavior on a dataset.
- Auditing Games
  - technique for evaluating interpretability tools, not a technique for evaluating the extent to which we understand a model
- Prediction-based Evaluation